Unsupervised machine learning framework for discriminating major variants of concern during COVID-19

Due to the high mutation rate of the virus, the COVID-19 pandemic evolved rapidly. Certain variants of the virus, such as Delta and Omicron emerged with altered viral properties leading to severe transmission and death rates. These variants burdened the medical systems worldwide with a major impact to travel, productivity, and the world economy. Unsupervised machine learning methods have the ability to compress, characterize, and visualize unlabelled data. This paper presents a framework that utilizes unsupervised machine learning methods to discriminate and visualize the associations between major COVID-19 variants based on their genome sequences. These methods comprise a combination of selected dimensionality reduction and clustering techniques. The framework processes the RNA sequences by performing a k-mer analysis on the data and further visualises and compares the results using selected dimensionality reduction methods that include principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), and uniform manifold approximation projection (UMAP). Our framework also employs agglomerative hierarchical clustering to visualize the mutational differences among major variants of concern and country-wise mutational differences for selected variants (Delta and Omicron) using dendrograms. We also provide country-wise mutational differences for selected variants via dendrograms. We find that the proposed framework can effectively distinguish between the major variants and has the potential to identify emerging variants in the future.


Introduction
Coronaviruses (CoVs) consist of enclosed, positive-sense, single-stranded, and diversified Ribonucleic acid (RNA) viruses [1]. CoVs comprise major variants that occur through mutations, also known as genera, including delta, gamma, beta and alpha [2,3]. Currently variations of concern and country-wise mutational differences for selected variants via dendrograms. Our study deals with the RNA sequences of coronavirus, and hence we use k-mer analysis before applying selected dimensionality reduction and clustering methods. We also provide an open-source code framework developed in Python to further extend the study to emerging variants. We investigate the effect of the prominent dimensional reduction methods since they have certain strengths and limitations which have mostly been shown for tabular data. Hence, our major contribution is in evaluating the dimensional reduction methods with a study of visualisation produced by them for genome analysis. We organise the remaining sections of this paper as follows. Section 2 provides an overview of the framework via unsupervised machine learning for distinguishing major variants. Section 3 presents the results, and Section 4 discusses the results. Lastly, Section 5 provides a conclusion of the study.

Data
Nowadays, the global initiative on sharing Avian influenza data (GISAID) [35] is recognized as a reliable portal for prompt sharing of COVID-19 data [36]. Currently, GISAID is the largest publicly accessible platform, consisting of sequences and associated epidemiological data of over 12.1 million SARS-CoV-2 strains (https://www.gisaid.org/hcov19-variants/). Due to the tremendous effort by scientists, new SARS-CoV-2 variants of concern have been included in GISAD, such as B.1.1.7 (Alpha; first detected in the United Kingdom), B.1.617.2 (Delta; first detected in India) and B.1.1.529 (Omicron; first detected in South Africa) [37,38]. GISAID provides prompt updates to formulate crucial public health policies to control COVID-19 situations globally.
We extracted 250 randomly selected SARS-CoV-2 isolates of complete genome sequences of human origins from GISAID on 12th September 2022. The five variants (Alpha, Beta, Gamma, Delta, and Omicron) featured 50 genome sequences each. Table 1 presents the metainformation from the top 10 countries based on the number of genome isolates for the selected variants across the globe.
In addition, we extracted 250 further genome sequences each for Delta and Omicron on 16th September 2022 from GISAID to visualize the country-wise mutational differences with meta-information in Tables 2 and 3.

k-mer analysis
Data pre-processing methods such as k-mer analysis have been prominent in the analysis of genome (DNA) sequences. k-mers are substrings of genome sequences of length k and analysis is done to calculate the frequency of the substrings. A k-mer refers to all of a sequence's substring of length k; for instance, the sequence "ATGG" would have four monomers (A, T, G, and G), three 2-mers (AT, TG, GG), two 3-mers (ATG and TGG), and one 4-mer (ATGG). Effective k-mer analysis can reduce computational time for sequence processing and provide better data storage for further analysis with statistical methods [39]. k-mer analysis is extensively used in numerous bioinformatics problems, including computational genomics and sequence analysis [40] and has also been applied for COVID-19. The major challenge of k-mer analysis is in determining the value of "k" which needs to be determined experimentally for different problems. A number of packages in languages such as R and Python exist for k-mer analysis [41,42]. Typically, k-mers consisting of ambiguous bases i.e. bases not identified during sequencing, such as 'N' which represents any possible nucleotide, are deleted. After k-mer analysis, the distance between a pair or a group of sequences can be visualized using unsupervised machine learning methods. DNA is represented with bases paired on the opposite strands (double-stranded) [43] and typically sequenced on either of the two strands. We need to consider every location of the genome once, no matter which has been considered. For instance, our analysis for sequence "ATCGAC" would consider its reverse complement "GTCGAT". In canonical k-mer count, the k-mers that are reverse complements of themselves are counted twice. Typically, k-mer counting tools [41, 42,44] either count in canonical k-mers or have the option to switch between canonical and non-canonical [31]. We note that there are three types of k-mer count, which include total, unique, and distinct k-mers. Hence, distinct k-mers would be counted only once, while unique k-mers are those that appear only once. Therefore, the sequence "ATCGATCAC" in non-canonical form, would have 7 total 3-mers, 6 unique 3-mers, and 5 distinct 3-mers, respectively. We used the kmer package [45] in R with default values that used total count in con-canonical form.

Dimensionality reduction
PCA. PCA is a dimensionality reduction method extensively used in various forms of data reduction, data analysis, and data visualization with applications in computer graphics [46], machine learning [47], and bioinformatics [22,23]. The aim of PCA is to calculate the most relevant linear basis to represent a complex data set. Thus, PCA is a linear combination of the basis vectors which reduces the dimensions while retaining the most crucial information.
Another assumption of PCA is that the principal components are orthogonal. This assumption is essential as it serves as an intuitive simplification which means PCA can function with linear algebra decomposition approaches. In the field of medicine, PCA is used to solve various problems, including multicollinearity clinical studies [48]. PCA has been used to detect phenotypes in order to forecast the severity of COVID-19 and implement an individual treatment [49]. Similarly, PCA has been utilized to automatically classify five types of electrocardiogram (ECG) to detect aberrant cardiac electrical activity [50]. However, limitations of PCA exist in sparse datasets, datasets with uncorrelated features, and datasets with outliers [51].
t-SNE. t-SNE is a nonlinear dimensionality reduction method that is also used for the visualization of high-dimensional data into a low-dimensional space of two or three dimensions. t-SNE is an extension of stochastic neighbour embedding(SNE) [52] with two key modifications that include a student t-distribution rather than a Gaussian and a symmetrical form of the SNE cost function with basic gradients. t-SNE has been widely used in the domain of medicine, and bioinformatics [53] e.g. in molecular dynamics simulations of macromolecules for visualization [54], and motor behaviour in Parkinson's disease [55]. However, a major limitation of t-SNE is the visualization of the entire structure of the data and the lack of information, such as explained variance ratio that is given by PCA. Since the dimensionality reduction in t-SNE is based on local properties of the data, it could face challenges in high dimensional structure. Hence, it is important to evaluate its performance for different applications. Therefore, in this study, we compare t-SNE with other dimensionality reduction methods.
UMAP. UMAP is a manifold learning approach for dimensionality reduction which employs a conceptual structure according to the Riemannian geometry and algebraic topology [28]. UMAP has been shown to perform comparably to t-SNE in terms of visualization quality [56] and potentially retains better global structure with less computation time. Additionally, UMAP does not have computational restrictions on the dimension of embedding, making it practical as a dimension reduction approach for various problems. UMAP can be expressed in the form of weighted graphs, which places UMAP in the category of k-neighbour-based graph learning models such as Isomap [57] and t-SNE. Together with various k-neighbour graphbased models, UMAP can be expressed in two parts. In the first part, a specific weighted kneighbour graph is generated, and in the second part, a low-dimensional outline of this graph is calculated. UMAP has been successful in bioinformatics problems such as dimensionality reduction and visualization of single-cell data [58] and transcriptomics data [29].

Agglomerative clustering
Hierarchical agglomerative clustering [59], also known as agglomerative nesting (AGNES) provides a better approach by addressing the problem of k-means clustering, where k needs to be manually tuned. In an agglomerative clustering model, the clustering initiates with individual collections of every data point [60]. AGNES has been extensively used in various medical domains [61,62], such as categorizing patients with severe aortic stenosis [63], and mapping molecular substructures [64]. However, AGNES has been ineffective in some problems since finding the nearest pair of clusters can be challenging when data is sparse and noisy [65].
AGNES produces a dendrogram that visualizes the hierarchical relationship amongst the clusters. A dendrogram consists of a tree-like structure for interpretive machine learning which provides a visualisation of how the data instances are allocated to the respective clusters. Phylogenetic associations interpreted from genome sequences are conventionally presented as trees, which can also be represented using dendrograms [66].

Framework
Fig 1 presents the framework for discriminating and visualizing major COVID-19 variants based on genome (RNA) data of the virus. In the first step, we extract data from the GISAID database where we take random samples of selected variants to demonstrate the effectiveness of the framework.
In the second step, we break down the genomes into k-mers with selected values of k and evaluate the most appropriate for effective visualization via PCA in the next step. We remove any ambiguous base in the genome accordingly using the package employed by the framework [45]. We select the best value of k in k-mer analysis based on the explained variance ratio of the first two principal components of the reduced dataset. We choose the k that provides the highest value of the combined explained variance ratio. The framework also reports a scree-plot to show the explained variance ratio so that the number of principal components in PCA adequately represents the original data that can be selected.
Subsequently, in step three, we compare the selected dimensionality reduction approaches that include PCA, t-SNE, and UMAP. Note that our framework is general and other dimensional reduction approaches such as Isomap and linear discriminant analysis (LDA) can also be utilized. In this step, we compare the visualization produced by the first two components of the respective approaches for the selected COVID-19 variants.
In Step 4(a), we take the data after k-mer analysis and apply clustering via AGNES. We first visualize the mutational differences among the five variants (Alpha, Beta, Gamma, Delta, and Omicron), and then visualize the country-wise differences between the genome sequences of Delta and Omicron.
Finally, in Step 4(b), we investigate how the variants compare with others based on their country. The major motivation of this investigation is to track future variants as they are moving from country to country at different times.

Implementation
In our proposed framework, we implement k-mer analysis using the k-mer [45] R package and the scikit-learn Python package [67] for implementing the dimensional reduction methods (PCA, UMAP, t-SNE). We also use the same package to implement the clustering approach and provide visualizations using standard R libraries (ggplots). Our framework is available via the GitHub repository which is included in the data section of this paper. In our experiments, we use the Macintosh Operating System with an Apple M1 chip featuring 8-core GPU

PLOS ONE
(graphics processing units) and 8-core CPU (central processing units). Note that our framework excludes GPU and utilizes CPU computational power only.

k-mer and PCA analysis
We first investigate the optimal value of k for k-mer analysis of the selected genomes via explained variance ratio of PCA (Step 3 of framework given in Fig 1). In this way, we understand the best value obtained by different k-mer analyses, where k 2 {3, 5, 7}. We use the dataset of 250 randomly selected coronavirus sequences (Table 1) for the five variants. Fig 2 presents the scree-plot of the proportion of variance explained by the different number of principal components (PCs) obtained via PCA for different values of k in k-mer analysis. We observe that the total explained variance decreases as the value of k increases; hence, the best value is given by k = 3. Table 4 shows the top 5 principal components (PC) variance ratio for 3 selected k values. Note that k = 3 shows the highest total variance ratio; hence, this is selected for future analysis. The proportion of explained variance by the first component is 53.3% for k = 3 with a total of around 75%; however, for k = 5, the explained variance falls drastically to a total of around 44%. Similarly, when k = 7, the proportion of explained variance decreases further to around 18%. This means that the k-mer analysis with increasing values of k has an inverse relationship with the explained variance ratio.

Visualisation using dimensionality reduction methods
In the previous section, we ran PCA-based dimensionality reduction to evaluate the value of k in k-mer analysis based on explained variance ratio. Dimensionality reduction methods such as PCA can be used to visualize data via scatter plots of the first two principal components. In this way, we have a better picture of the data, giving more insight than explained variance ratio. Next, we take the same dataset, i.e., SARS-CoV-2 genome isolates from 5 distinct clusters (Table 1), and run PCA and two other dimensional reduction methods (t-SNE and UMAP), as outlined in our framework shown in Fig 1. We visualize the different dimensionality reduction methods by varying the value of k and present a two-dimensional scatter plot of the first two components. Unlike PCA, t-SNE and UMAP do not provide explained variance ratio, so it is unclear what percentage of data is represented by the first two components; however, we can visually evaluate them based on the scatter plot.
Figs 3-5 present the visualization with PCA, UMAP, and t-SNE for selected k values from k-mer analysis. Fig 3-Panel (a) shows that Omicron and Gamma variants are close and overlap each other. This is different when compared with Fig 4-Panel (a) which also shows that Omicron is isolated. Fig 5-Panel (a) shows that Omicron overlaps the Gamma variant. We note that with k = 5, only 26% of the data is represented by the first two components (Table 4), and only 10% of data is represented by the first two components of k = 7. Hence, we can say that k = 3 is the most reliable since it represents 61% of the data by the first two components. Although PCA shows a greater variance ratio for k = 3; visually, it is poor in discriminating variants when compared to k = 5 and k = 7.
In Fig 3-Panel (b), we find that there is a further separation of the variants using UMAP. In this case, the Alpha variant is separate while it is overlapping using PCA, as shown in Fig 3  -Panel (a). In Fig 3-Panel(c), we find that t-SNE is poor in discriminating the variants; however, t-SNE improves when k = 5 and k = 7 in Fig 4-Panel(c) and Fig 5-Panel(c). In the case of UMAP, these figures show that the distance between distinct clusters becomes more

PLOS ONE
Unsupervised machine learning framework  apparent (increases) as the value of k increases. This is also apparent in the case of t-SNE, where k = 5 and k = 7 provides better visualization in discriminating cluster of variants. Furthermore, Table 5 presents the computational time where PCA uses the lowest computational time followed by UMAP and t-SNE. Although this is not a problem for this study since only a small dataset is utilized (250 genome sequences), the computational time would be an issue when millions of sequences need to be processed. Note that number of features obtained after the k-mer analysis is also shown which indicates how the dataset size changes with different values of k while representing the same problem.

Clustering
We apply AGNES clustering (Step 4 of the Framework in Fig 1) and obtain a dendrogram using the original dataset consisting of 250 randomly selected SARS-CoV-2 genome isolates. Fig 6 presents the visualization obtained from the dendrogram where we can see the distinction by groups of variants. We represent each genome isolate by a data point using a horizontal line in the plot. The dendrogram demonstrates the relationship between genome isolates and comprises sequences that are classified into every cluster. The value of every sequence is according to the weighted dissimilarity computation that scientists use for clustering. In Fig 6, we note that there are some mutational differences between variants of the same type owing to high mutation rates of SARS-CoV-2. These mutational differences help in identifying the variant lineage and can be used to track the route of transmission from one region to another. However, these mutations do not drastically alter variant properties. We also notice that in certain cases, certain variants such as Beta are close to Delta which is in the top cluster that falls under the distance of less than 3.0.
Finally, we apply AGNES on a set of 300 randomly selected SARS-CoV-2 genome isolates (Table 3) of the Omicron variant and obtain a dendrogram that shows the respective countries and how they are related (Step 4(b)) of the Framework in Fig 1). Note that we added 50 randomly selected isolates to the 250 isolates selected previously to get 300 isolates. Fig 7 presents the dendrogram obtained from the visualization for Omicron where we can see that there are some mutational differences between Omicron genome sequences from different regions. As mentioned above, these give rise to different lineages but do not alter the viral phenotype. We also observe that there is a certain level of similarity between variant sequences from different regions, for example, (USA, Spain, Brunei) and (India, Morocco, South Africa). In future work, the point of origin of these similarities can be traced by doing a spatiotemporal analysis of the data. Hence, once extended, this framework has the potential to track the trend of viral spread from one region to another.
A similar country-wise analysis was also performed on 300 randomly selected SARS-CoV-2 genome isolates of the Delta variant. Fig 8 presents the dendrogram obtained from the visualization for Delta, where we observe that the level of similarity for the genome sequence among different regions was much higher than that observed for Omicron. This supports the fact that the observed number of lineages for Omicron (7) is more than that for Delta(2) [68], which further hints at a higher mutation rate for the Omicron variant.

Discussion
The main contribution of this study lies in examining the COVID-19 isolates using classical and novel dimensionality reduction and clustering methods. In general, we found that UMAP performs better than PCA and t-SNE for the given COVID-19 genome isolates. It is effective in visualizing clusters as it takes the nonlinearity of the data into account, unlike PCA, and can capture the global structure of the data better than t-SNE. We also note that k-mer analysis is an important data pre-processing step when dealing with genomics data. Our results show that the value of k plays a crucial role in capturing the features. Depending on the method (PCA, UMAP, and t-SNE), it is critical to choose the right value of k for k-mer analysis. Even though small values of k often lead to information loss, a larger value of k, while preserving important information, demands more computational resources. Thus, we conclude that it is reasonable not to go further than k = 7, as it can take further computational time (Table 5) and storage during genome sequence pre-processing. We note that k = 3 shows the highest explained variance ratio of PCA in Table 4. Fig 3 shows that k = 3 from PCA is not as good as UMAP which has been better in discriminating the variants. Hence, although k = 3 has a high explained variance ratio, it is not appropriate when we check the visualisation and compare it with k = 7 in Fig 5 where UMAP shows the best results. Hence, we can note that PCA is not an appropriate method for visualisation in this case and k = 7 is the best k-mer count value which is supported by UMAP results. UMAP is a nonlinear dimensionality reduction method that creates simplicial complexes by connecting points if the distance between them is below a threshold. UMAP uses these complexes to calculate the relative distance in the lower dimensions, unlike t-SNE, which does it randomly [28]. PCA on the other hand, cannot capture non-linear dependencies as it is a linear projection and its primary goal is to find directions that maximize the variance in the dataset. Due to these reasons, UMAP scales well given different variations in k-mer analysis and also provides a better visual representation with less computational time when compared with PCA and t-SNE (Figs 3-5). On the other hand, PCA provides further insights using explained variance ratio which in addition while UMAP gives a good overview of the data. It will be useful to have the feature of explained variance ratio in UMAP and t-SNE, this can shed more light on k-mer analysis as done by PCA.
In our framework, we utilise dendrograms via AGNES for visualizing the mutational differences and similarities among various groups. AGNES is easy to implement as it does not require prior information about the number of clusters but the time complexity is high and thus it is computationally expensive for larger and more complex datasets. Similarly, dendrograms well interpret but become less resourceful as the complexity of data increases.
Phylogeny reconstruction describes evolutionary relationships in terms of the relative recency of common ancestry which are typically represented as a branching diagram, or tree, with branches joined by nodes [69]. Phylogeny reconstruction has been prominent in studying evolutionary biology, particularly in visualising the relationship amongst genome sequences [70]; and hence, it has been used for COVID-19 isolates [71]. Our framework produces dendograms that are similar to phylogeny reconstruction, with additional visualisation using dimensional reduction methods. Phylogeny reconstruction has certain limitations which have been demonstrated for amido acid sequence data [72]. The phylogeny trees may not necessarily accurately represent the evolutionary history of the data [73]. A dendrogram is a general term for any type of phylogeny tree (scaled or unscaled). Hence, our framework is not replacing phylogeny reconstruction, but providing a means to provide additional information. We not only provide the dendrograms but also visualization by dimensionality reduction methods to distinguish variants of concern. In future work, a comparison between dendrogram and phylogeny reconstruction would highlight the strengths and limitations of the different approaches.
In future work, the proposed framework can be extended further with novel dimensionality reduction and clustering methods. Therefore, the other novel dimensionality reduction approaches such as Ivis [74] could be considered which is good in extremely large datasets. The genome data extraction using k-mer analysis can be compared with alternatives, such as strobemers [75,76] which is gaining attention in the area of genome sequence analysis.
We note that given the efficacy of the framework in distinguishing different variants of concern, the framework can be used for assisting scientists and policymakers in pandemic management. The framework can be used for large-scale temporal and spatial study of the emergence of major variants of COVID-19 in selected countries, and also globally which can help in better understanding the infection and death rate trend. This can also give an insight into the effectiveness of vaccination programs and boosters [77] for different variants. Furthermore, the framework can be used to perform a spatiotemporal analysis to study the pattern of the spread of infection from one region to another. It can also be extended to perform a similar analysis on future outbreaks (pandemics) to understand the nature of emerging variants. Finally, the web-based application can be developed using our framework that features geolocation and interactive maps (country-wise and worldwide) displaying different variants and their evolution over time.
The limitations of the study include the meta-information provided in the COVID-19 genome isolates since a large number of samples only have dates associated with the data uploaded rather than when they were taken. It is important to know meta-information such as the date and time of the samples collected in order to have further insights into the changing nature of the variants. We also need to note that the number of variants and the number of samples for each variant are magnitudes lower than the number registered in the dataset. Our framework handled a few hundred samples and could be extended to thousands of samples. However, catering millions of samples across space (geo-location) and time would be computationally intensive and parallel computing facilities would be needed.

Conclusion
We presented a framework that provides insights that can further help scientists in effectively discriminating the COVID-19 variants that rapidly change due to mutations. In our framework, we evaluated the dimensional reduction components of the framework with different methods and found that UMAP provides the best dimensionality reduction and visualization tool for the genome sequences. We showed that PCA used in conjunction with t-SNE and UMAP addresses the limitations of the latter methods since they do not provide explained variance ratio. In many applications, only visualization of the data cannot address the problem; it is critical to know the contribution of the different features which can only be known through PCA. Furthermore, the visualization of the emerging COVID-19 variants using dendrograms via clustering can provide detailed insights about their evolution which can be extended to larger datasets.